Left join could use bitmap for left join instead of Vec<bool>#342
Left join could use bitmap for left join instead of Vec<bool>#342boaz-codota wants to merge 1 commit into
Conversation
|
@Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly |
Yes, this is awesome, exactly what I meant! Impressive how little is changed. I will do some benchmarking tonight or tomorrow to see if it's changing anything (but I believe this part is not the most performance sensitive anyway). @alamb @andygrove @jorgecarleitao |
Codecov Report
@@ Coverage Diff @@
## master #342 +/- ##
==========================================
- Coverage 75.72% 75.71% -0.01%
==========================================
Files 143 143
Lines 23832 23835 +3
==========================================
+ Hits 18046 18047 +1
- Misses 5786 5788 +2
Continue to review full report at Codecov.
|
|
Thanks a lot @boaz-codota ! Is there any performance difference? imo It would be good to measure before adding a dependency. Note that arrow also has an |
The idea is here is not performance per se, but memory savings. As this is an in-memory join, any saved memory is welcome, as it will increase the chance a given join will fit in memory. Using a boolean array buffer builder makes sense to me to avoid the dependency. |
There was a problem hiding this comment.
Given we already have Arrow as dependency which has bitmaps, I like @jorgecarleitao 's idea of using code from that crate instead (such as https://docs.rs/arrow/4.0.0/arrow/bitmap/struct.Bitmap.html).
|
@alamb @jorgecarleitao I'm trying to replace the usage of bitvec with arrow's structures. Both |
|
I agree both the datastructures don't seem very useful for this use case. I don't think there is a mutable structure like bitvec in Arrow... |
|
It is just buried in the catacombs of the |
|
Ah, sorry, I see; you need mutable access to an item in the middle of the builder; yeap, that is currently not supported. It assumes that it is always pushed/advanced. |
|
Yes, I think access to the middle only could be done by accessing / mutating the contents, in which case you are already close to rolling your own bitvec API. |
|
To be fair, it is fairly easy to enable that on the I find the bitvec a bit overkill for a single use-case of a bitmap, specially when we already have a "bitvec" on arrow, but I am also fine pushing this through and leave this problem for later. |
|
I also am hesitant to add a new dependency to DataFusion without some sort of compelling performance or memory improvement |
|
I'll try to tackle adding the set_bit to Arrow's |
|
I'm stuck on this, I've implemented the code which I think that should work, but it is not working as expected. I opened the PR to get help. |
|
FYI arrow 4.3.0 has been released with the code in BooleanBufferBuilder |
|
I'll refactor this as soon as I have some free time. Thanks @alamb ! |
|
@alamb tried to refactor the code now, noticed that there is no |
I am not sure
One thing that might be helpful would be to prototype out this PR using a local checkout / fork of the arrow-rs repo so that you can get everything working and find other potential "gotchas" without having to wait on the arrow-rs release cycle Once you figured out what you needed then we could merge it according in parts |
|
Implemented in an arrow native way here: #884 |
Which issue does this PR close?
Closes #240 .
Rationale for this change
Described in the issue.
What changes are included in this PR?
Described in the issue.
Are there any user-facing changes?
No user facing changes.